Contents:
Exercises:
We analyze two variables to find out if there might be some kind of association between them. Even though that may be difficult to clearly identify, bivariate analysis still helps reveal signs of association that may serve at least to raise concern.
As before, the nature of the data allows for some particular analytical techniques, while also providing some limitations to our inferences. Let’s see what we can visualize with the different combinations of data.
This time, I will use the data about crime from the Seattle Open Data portal:
link="https://github.com/EvansDataScience/data/raw/master/crime.RData"
load(file = url(link))
The data available are:
names(crime)
## [1] "Report.Number" "Occurred.Date"
## [3] "year" "month"
## [5] "weekday" "Occurred.Time"
## [7] "Occurred.DayTime" "Reported.Date"
## [9] "Reported.Time" "DaysToReport"
## [11] "crimecat" "Crime.Subcategory"
## [13] "Primary.Offense.Description" "Precinct"
## [15] "Sector" "Beat"
## [17] "Neighborhood"
A quick look will give us:
head(crime)
## Report.Number Occurred.Date year month weekday Occurred.Time
## 1 20130000244104 2013-07-09 2013 7 Tuesday 1930
## 2 20130000242916 2013-07-09 2013 7 Tuesday 1917
## 3 20130000904541 2013-07-09 2013 7 Tuesday 1900
## 4 20130000243875 2013-07-09 2013 7 Tuesday 1900
## 5 20130000242883 2013-07-09 2013 7 Tuesday 1846
## 6 20130000242880 2013-07-09 2013 7 Tuesday 1844
## Occurred.DayTime Reported.Date Reported.Time DaysToReport
## 1 evening 2013-07-10 1722 1
## 2 evening 2013-07-09 2052 0
## 3 evening 2013-07-10 35 1
## 4 evening 2013-07-10 1258 1
## 5 evening 2013-07-09 1846 0
## 6 evening 2013-07-09 1849 0
## crimecat Crime.Subcategory
## 1 NARCOTIC NARCOTIC
## 2 BURGLARY BURGLARY-RESIDENTIAL
## 3 CAR PROWL CAR PROWL
## 4 THEFT MOTOR VEHICLE THEFT
## 5 FAMILY OFFENSE-NONVIOLENT FAMILY OFFENSE-NONVIOLENT
## 6 BURGLARY BURGLARY-RESIDENTIAL
## Primary.Offense.Description Precinct Sector Beat Neighborhood
## 1 NARC-FRAUD-PRESCRIPTION NORTH U U3 SANDPOINT
## 2 BURGLARY-NOFORCE-RES NORTH L L3 LAKECITY
## 3 THEFT-CARPROWL SOUTH R R2 MOUNT BAKER
## 4 VEH-THEFT-AUTO NORTH U U3 ROOSEVELT/RAVENNA
## 5 CHILD-OTHER WEST Q Q2 QUEEN ANNE
## 6 BURGLARY-FORCE-RES WEST Q Q2 QUEEN ANNE
Let’s see what kind of data we have:
str(crime,width = 70,strict.width='cut')
## 'data.frame': 499698 obs. of 17 variables:
## $ Report.Number : chr "20130000244104" "201300002429"..
## $ Occurred.Date : Date, format: "2013-07-09" ...
## $ year : num 2013 2013 2013 2013 2013 ...
## $ month : num 7 7 7 7 7 7 7 7 7 7 ...
## $ weekday : Ord.factor w/ 7 levels "Monday"<"Tu"..
## $ Occurred.Time : num 1930 1917 1900 1900 1846 ...
## $ Occurred.DayTime : Ord.factor w/ 4 levels "day"<"after"..
## $ Reported.Date : Date, format: "2013-07-10" ...
## $ Reported.Time : num 1722 2052 35 1258 1846 ...
## $ DaysToReport : num 1 0 1 1 0 0 0 0 1 0 ...
## $ crimecat : Factor w/ 20 levels "AGGRAVATED ASS"..
## $ Crime.Subcategory : Factor w/ 30 levels "AGGRAVATED ASS"..
## $ Primary.Offense.Description: Factor w/ 144 levels "ADULT-VULNERA"..
## $ Precinct : Factor w/ 5 levels "EAST","NORTH",....
## $ Sector : Factor w/ 23 levels "6804","9512",....
## $ Beat : Factor w/ 64 levels "B1","B2","B3",...
## $ Neighborhood : Factor w/ 58 levels "ALASKA JUNCTIO"..
The main way to organize these relationships are the contingency tables. Let’s select a couple of categorical variables:
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime))
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3564 5366 4884 7501
## ARSON 196 167 191 486
## BURGLARY 24139 22288 14121 16082
## CAR PROWL 26740 38273 42595 34839
## DISORDERLY CONDUCT 41 81 67 79
## DUI 706 939 2038 8522
## FAMILY OFFENSE-NONVIOLENT 1748 2516 1217 1120
## GAMBLE 4 4 7 2
## HOMICIDE 41 46 49 131
## LIQUOR LAW VIOLATION 112 491 410 606
## LOITERING 20 31 25 9
## NARCOTIC 2415 6416 3924 4109
## PORNOGRAPHY 65 53 17 31
## PROSTITUTION 115 675 1425 1340
## RAPE 332 318 354 854
## ROBBERY 2584 4737 4139 5372
## SEX OFFENSE-OTHER 1501 1759 1014 1776
## THEFT 38687 64868 38980 28410
## TRESPASS 4848 5184 2598 3289
## WEAPON 735 1445 947 1624
The table above shows counts, but in most situations, it is important to see relative values:
# using "pipes" to help readability:
library(magrittr)
(CrimeTotal=table(crime$crimecat,crime$Occurred.DayTime)%>% #create table and then...
prop.table() %>% #compute proportion and then...
"*"(100)%>% # multiply by 100 and then...
round(2) #...round to to decimals
)
##
## day afternoon evening night
## AGGRAVATED ASSAULT 0.71 1.07 0.98 1.50
## ARSON 0.04 0.03 0.04 0.10
## BURGLARY 4.83 4.46 2.83 3.22
## CAR PROWL 5.35 7.66 8.53 6.98
## DISORDERLY CONDUCT 0.01 0.02 0.01 0.02
## DUI 0.14 0.19 0.41 1.71
## FAMILY OFFENSE-NONVIOLENT 0.35 0.50 0.24 0.22
## GAMBLE 0.00 0.00 0.00 0.00
## HOMICIDE 0.01 0.01 0.01 0.03
## LIQUOR LAW VIOLATION 0.02 0.10 0.08 0.12
## LOITERING 0.00 0.01 0.01 0.00
## NARCOTIC 0.48 1.28 0.79 0.82
## PORNOGRAPHY 0.01 0.01 0.00 0.01
## PROSTITUTION 0.02 0.14 0.29 0.27
## RAPE 0.07 0.06 0.07 0.17
## ROBBERY 0.52 0.95 0.83 1.08
## SEX OFFENSE-OTHER 0.30 0.35 0.20 0.36
## THEFT 7.75 12.99 7.80 5.69
## TRESPASS 0.97 1.04 0.52 0.66
## WEAPON 0.15 0.29 0.19 0.33
Those tables show total counts or percents. However, when a table tries to hypothesize a relationship, you should have the independent variable in the columns, and the dependent one in the rows; then, the percent should be calculated by column, to see how the levels of the dependent variable varies by each level of the independent one, and compare along rows.
CrimeCol=table(crime$crimecat,crime$Occurred.DayTime)%>%
prop.table(margin = 2)%>% # 2 is % by column
"*"(100)%>%
round(3)
CrimeCol
##
## day afternoon evening night
## AGGRAVATED ASSAULT 3.282 3.447 4.104 6.456
## ARSON 0.180 0.107 0.161 0.418
## BURGLARY 22.229 14.319 11.866 13.842
## CAR PROWL 24.624 24.588 35.794 29.987
## DISORDERLY CONDUCT 0.038 0.052 0.056 0.068
## DUI 0.650 0.603 1.713 7.335
## FAMILY OFFENSE-NONVIOLENT 1.610 1.616 1.023 0.964
## GAMBLE 0.004 0.003 0.006 0.002
## HOMICIDE 0.038 0.030 0.041 0.113
## LIQUOR LAW VIOLATION 0.103 0.315 0.345 0.522
## LOITERING 0.018 0.020 0.021 0.008
## NARCOTIC 2.224 4.122 3.297 3.537
## PORNOGRAPHY 0.060 0.034 0.014 0.027
## PROSTITUTION 0.106 0.434 1.197 1.153
## RAPE 0.306 0.204 0.297 0.735
## ROBBERY 2.380 3.043 3.478 4.624
## SEX OFFENSE-OTHER 1.382 1.130 0.852 1.529
## THEFT 35.626 41.674 32.756 24.453
## TRESPASS 4.464 3.330 2.183 2.831
## WEAPON 0.677 0.928 0.796 1.398
The complexity of two variables requires plots, as tables like these will not allow you to discover association patterns easily, even though they are already a summary of two columns. However, you must check the data format the plotting functions require, as most plots will use the contingency table as input (not the raw data).
As before, we can use the bar plot with the contingency table as input:
barplot(CrimeCol)
This plot will need a lot of work, so the base capabilities of R may not be a good strategy.
However, when using alternative/more specialized plotting features you may need to convert your table into a dataframe of frequencies, let me create the base proportions table:
df.T=as.data.frame(CrimeTotal) # table of proportion based on total
# YOU GET:
head(df.T)
## Var1 Var2 Freq
## 1 AGGRAVATED ASSAULT day 0.71
## 2 ARSON day 0.04
## 3 BURGLARY day 4.83
## 4 CAR PROWL day 5.35
## 5 DISORDERLY CONDUCT day 0.01
## 6 DUI day 0.14
We should rename the above table:
names(df.T)=c('Crime','Daytime','Percent') #renaming
head(df.T)
## Crime Daytime Percent
## 1 AGGRAVATED ASSAULT day 0.71
## 2 ARSON day 0.04
## 3 BURGLARY day 4.83
## 4 CAR PROWL day 5.35
## 5 DISORDERLY CONDUCT day 0.01
## 6 DUI day 0.14
A first option you may have is to reproduce the table:
library(ggplot2)
base = ggplot(df.T, aes(Daytime,Crime))
# plot value as point, size by value of percent
tablePlot1 = base + geom_point(aes(size = Percent), colour = "gray")
# add value of Percent as label, move it
tablePlot2 = tablePlot1 + geom_text(aes(label = Percent),
nudge_x = 0.1,
size=2)
tablePlot2
…some more work:
tablePlot3 = tablePlot2 + scale_size_continuous(range=c(0,10)) #change 10?
tablePlot4 = tablePlot3 + theme_minimal() # less ink
tablePlot4 + theme(legend.position="none") # no legend
The plot looks nice, but unless the differences are clearly cut, you may see more noise than information, which distracts and delays decision making. Keep in mind that length of bars are easier to compare than circle areas. You need a barplot, but with better tools:
base = ggplot(df.T, aes(x = Crime, y = Percent ) )
bars1 = base + geom_bar( stat = "identity" ) + theme_minimal()
# bar per day time with 'facet'
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1)
bars2
…some more work:
# change the minimal theme
bars3 = bars2 + theme( axis.text.x = element_text(angle = 90,
hjust = 1,
size=3 ) )
bars3
And, the original relationship Input-Output table can be plotted like this:
df.C=as.data.frame(CrimeCol)
colnames(df.C)=c('Crime','Daytime','Percent')
#####
base = ggplot(df.C, aes(x = reorder(Crime, Percent), y = Percent ) )
bars1 = base + geom_segment(aes(y = 0, x = reorder(Crime, Percent), yend = Percent, xend = Crime), stat = 'identity', color = "black")
bars2 = bars1 + facet_wrap( ~ Daytime ,nrow = 1) + geom_point()
bars3 = bars2 + coord_flip() + theme(axis.text.y = element_text(size=6,angle = 0)) + labs(x = "Total Percentage Per Daytime", y = "Percentage", title = "Percentage of Crime Based on Daytime", caption = "Source: Seattle Open Data Portal")
bars4 = bars3 + theme(panel.background = element_rect(fill = 'gray97'), plot.title=(element_text(hjust=0.5)), plot.caption=element_text(vjust = 0.1), axis.title.x = element_text(vjust = -1), axis.title.y = element_text(vjust = 2, hjust = .5))
bars4
#The type of crime is not ordinal, then we could reorder the bars:
Exercise 1:
Turn the bars into lollipop with the right components.
Once you see a plot of two bivariate categorical data, you may consider other plots:
# heatplot
base = ggplot(df.C, aes(x = Daytime, y = reorder(Crime, -Percent), fill = Percent)) #Add a (-) in front of percent
heat1 = base + geom_tile()
heat2 = heat1 +scale_fill_gradient(low = "yellow",
high = "purple")
heat2
Some work on the legend:
heat3 = heat2 + theme_classic()
heat4 = heat3 + theme(axis.text.x = element_text(angle = 0, vjust = 0.6),
legend.title = element_text(), #no title for legend
legend.position="top",
legend.direction="horizontal",
legend.key.width=unit(1, "cm"),
legend.key.height=unit(1, "cm"))
heat4 + labs(y="Crime")
Exercise 2:
Change the heatplot to ascending order, where intensity goes from yellow to purple. _____
Similar to the previous case, categorical variables can be used for understanding the behavior of numerical variables. In this case, the curiosity and experience of the analyst is critical in mining the data to reveal some insight. This is so because numerical data have longer value ranges than categorical data. The data used will be a good example of this.
In the previous data set we had a variable that informs the amount of days it takes someone to report a crime:
summary(crime$DaysToReport)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 7.65 1.00 36525.00 2
There are several good categorical variables that can be used to study the behavior of this one. Let’s use precint:
tapply(crime$DaysToReport,crime$Precinct,mean)
## EAST NORTH SOUTH SOUTHWEST WEST
## 7.216687 7.104350 7.219687 11.063208 4.897332
Above, you see the mean time (in days) it takes per precint for people to notify a crime. You can suddenly create a plot in your mind just by reading those values, but the plot you imagine may be far from this one:
boxplot(DaysToReport~Precinct,data=crime)
The plot above would not give so much insight, there is so much noise. The fact is that a better summary would tell us more to consider:
tapply(crime$DaysToReport,crime$Precinct,summary)
## $EAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 7.22 1.00 36525.00
##
## $NORTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 7.104 1.000 14268.000
##
## $SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 7.22 1.00 11292.00
##
## $SOUTHWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 11.06 1.00 11992.00
##
## $WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.897 1.000 16801.000
From the information above, you know that for each precint, the 75% of crimes are reported in a day or less. If we consider that situation as the expected behavior, we could omit those cases:
boxplot(DaysToReport~Precinct,data=crime,
subset = DaysToReport>1) #subsetting
We see no structure appear yet. Let me try different versions while teaching how to divide the screen:
par(mfrow=c(2,2)) # 2 rows, and two columns, by row:
boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=7,
main="One week or more")
boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=30,
main="One 30-day month or more")
boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=180,
main="180 days or more")
boxplot(DaysToReport~Precinct,data=crime,subset = DaysToReport>=365,
main="One year or more")
Up to this point, you need to be planing a good story. The situation is different for each case, but let’s build our story from the crimes that take a year or longer to report.
First, let’s see how many cases we have per precinct:
crimeYear=crime[crime$DaysToReport>=365,]
tapply(crimeYear$DaysToReport,crimeYear$Precinct,length)
## EAST NORTH SOUTH SOUTHWEST WEST
## 263 555 248 230 362
The year the crime occurred would be another variable to consider, to see if we can filter more cases:
tapply(crimeYear$DaysToReport,crimeYear$year,length)
## 1908 1964 1973 1974 1975 1976 1977 1978 1979 1980 1985 1986 1987 1989 1990
## 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2
## 1991 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## 2 4 1 6 3 5 20 8 39 52 14 28 36 43 76
## 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 136 167 130 141 121 160 141 135 127 125 67
If we were to plot by year, several years before 2000 are too few (cases with two or less may require a case study- some might even be typing mistakes). Let’s get rid of those old cases:
crimeY2000=crime[(crime$DaysToReport>=365) & (crime$year>=2000),]
tapply(crimeY2000$DaysToReport,crimeY2000$Precinct,length)
## EAST NORTH SOUTH SOUTHWEST WEST
## 257 543 237 214 358
Now, we see a better boxplot:
boxplot(DaysToReport~Precinct,data=crimeY2000,
main="One year or more (from 2000)")
For sure, it would be better if the numerical units were in years:
crimeY2000$YearsToReport=crimeY2000$DaysToReport/365
boxplot(YearsToReport~Precinct,data=crimeY2000,
main="One year or more (from 2000)")
This is a good data subset, but you still see that the distribution is pretty similar in every precint. At this stage, you can try visualizing the outliers distributions, which start around year five:
boxplot(YearsToReport~Precinct, data=crimeY2000, subset = YearsToReport>=5,
main="Five years or more (from 2000)")
If your story is about the similarity of distributions among precincts, you can start improving the last plots. But in case you need to show some variety you can try another relevant categorical variable.
Let’s try weekday and Ocurred.DayTime:
par(mfrow=c(2,1))
boxplot(YearsToReport~weekday,data=crimeY2000,
main="One year or more BY WEEKDAY (from 2000)",las=2)
boxplot(YearsToReport~Occurred.DayTime,data=crimeY2000,
main="One year or more BY TIME CRIME OCCURRED (from 2000)",las=2)
Not much variety is perceived; then let’s try crimecat and year:
par(mfrow=c(2,1))
boxplot(YearsToReport~year,data=crimeY2000,
main="One year or more (from 2000)",las=2)
boxplot(YearsToReport~crimecat,data=crimeY2000,
main="One year or more (from 2000)",las=2)
Years to report seems interesting, you have a decreasing tendency, then we can spend some time improving it via ggplot:
# no missing:
crimeYearGG=crimeY2000[complete.cases(crimeY2000$YearsToReport),]
base = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport))
box = base + geom_boxplot()
box
It may be the case that your audience does not know how to read a boxplot. It is a great plot, but encoding so much statistical information. Then we can go simple, and use lines connecting the easy-to-understand points in every boxplot:
base = ggplot(crimeYearGG,aes(x=factor(year), y=YearsToReport))
mins = base + stat_summary(fun.y=min, # function for 'y' is min()
geom="line",
show.legend = T,size=1,
aes(group=1,col='Min'))
mins # just the min values
Let me add the max values:
minsMaxs= mins + stat_summary(fun.y=max,
geom="line",
linetype='dashed',
size=1,show.legend = F,
aes(group=1,col='Max'))
minsMaxs
Adding the median:
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
geom="line",size=2,
aes(group=1,col='Median'))
minsMaxsMd
Let’s take control of the colors by customizing the legend:
# Change color of lines:
all1=minsMaxsMd + scale_colour_manual(name="Trends",
values=c("blue", "black","red"))
all1
Now, let’s complete the story by telling how the data filtered behaves, that is, the crimes that took less than a year to report since 2000 (we will not include data from before):
# data preparation:
crimeWeek=crime[(crime$DaysToReport<365) & (crime$year>=2000),]
crimeWeek$WeeksToReport=crimeWeek$DaysToReport/7
crimeYearGG2=crimeWeek[complete.cases(crimeWeek$WeeksToReport) &complete.cases(crimeWeek$crimecat),]
#plotting it:
base = ggplot(crimeYearGG2,aes(x=factor(year), y=WeeksToReport))
mins = base + stat_summary(fun.y=min,size=1,
geom="line", linetype='dashed',show.legend = T,
aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
geom="line",size=1,show.legend = F,
aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,
geom="line",size=2,
aes(group=1,col='Median'))
all2=minsMaxsMd + scale_colour_manual(name="Trends",
values=c("blue", "black","red")
)
all2
We also found variability in the type of crime, so we could try a story with it; first with Years to report (for crimes that took a year or longer to report, after year 2000):
base= ggplot(crimeYearGG,
aes(x = reorder(crimecat, YearsToReport, FUN = max), # reorder!
y=YearsToReport))
mins = base + stat_summary(fun.y=min,size=1,
geom="line", linetype='dashed',show.legend = T,
aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
geom="line",size=1,show.legend = F,
aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median, size=2,
geom="line",
aes(group=1,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
values=c("blue", "black","red"))
all3 + coord_flip()
Now, for crimes that took less than year to report after year 2000:
base = ggplot(crimeYearGG2,
aes(x = reorder(crimecat, WeeksToReport, FUN = max),
y=WeeksToReport))
mins = base + stat_summary(fun.y=min,size=1,
geom="line", linetype='dashed',show.legend = T,
aes(group=1,col='Min'))
minsMaxs= mins + stat_summary(fun.y=max,
geom="line",size=1,show.legend = F,
aes(group=1,col='Max'))
minsMaxsMd= minsMaxs + stat_summary(fun.y=median,size=2,
geom="line",
aes(group=2,col='Median'))
all3=minsMaxsMd + scale_colour_manual(name="Trends",
values=c("blue", "black","red"))
all4 = all3 + theme(axis.title.y=element_text(vjust = 2.5), axis.title.x = element_text(vjust = -1),panel.background = element_rect(fill = 'gray95'), plot.title=(element_text(hjust=0.5)), plot.caption=element_text(vjust=-1.5)) + labs(title = "Weeks To Report Based on Crime", caption = "Source: City of Seattle Crime Data", x = "Type of Crime", y = "Reporting Time (Measured in Weeks)")
#all3+coord_flip()
all4 + coord_flip()
Exercise 3:
Complete the information needed in the previous plots.
It is very common to hear in scientific texts about the mean difference test known as the one-way anova, which beyond describing, as we have done, seeks to show if the mean of the numerical variable varies or not accross the values of the categorical groups.
#making a subset:
anovaData=crimeY2000[crimeY2000$YearsToReport>=5,]
#checking the mean per factor value:
tapply(anovaData$YearsToReport, anovaData$Precinct, mean,na.rm=T)
## EAST NORTH SOUTH SOUTHWEST WEST
## 8.402359 8.717236 8.278557 8.790947 8.853487
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
group.CI(YearsToReport ~ Precinct,
data=anovaData,
ci = 0.95)
## Precinct YearsToReport.upper YearsToReport.mean YearsToReport.lower
## 1 EAST 9.424130 8.402359 7.380588
## 2 NORTH 9.206928 8.717236 8.227544
## 3 SOUTH 8.951394 8.278557 7.605721
## 4 SOUTHWEST 9.604103 8.790947 7.977791
## 5 WEST 9.844426 8.853487 7.862548
anovaData=anovaData[complete.cases(anovaData),]
# introducing ggpubr
library(ggpubr)
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
##
## mutate
ggline(data=anovaData,x = "Precinct", y = "YearsToReport",add = 'mean_ci',
error.plot = "pointrange") + scale_y_continuous(breaks=seq(7,10,0.5))
# Compute the analysis of variance
res.aov <- aov(YearsToReport ~ Precinct, data = anovaData)
# Summary of the analysis
summary(res.aov)[[1]]$Pr[1]
## [1] 0.7958203
# non parametric
kruskal.test(YearsToReport ~ Precinct, data = anovaData)
##
## Kruskal-Wallis rank sum test
##
## data: YearsToReport by Precinct
## Kruskal-Wallis chi-squared = 2.8431, df = 4, p-value = 0.5844
The study of bivariate relationships among numerical variables is known as correlation analysis. The data we have been using has few numerical columns, but I will produce two by aggregating the original data set using Neigborhood:
# 1. MEAN of days it takes to report a crime by neighborhood
daysByNeigh=tapply(crime$DaysToReport, crime$Neighborhood, mean,na.rm=T)
# you have:
head(daysByNeigh)
## ALASKA JUNCTION ALKI BALLARD NORTH BALLARD SOUTH
## 8.968437 14.463768 7.788203 5.262384
## BELLTOWN BITTERLAKE
## 5.012865 7.442241
# 2. PROPORTION of crimes by neighborhood
crimesByNeigh=tapply(crime$crimecat, crime$Neighborhood, length)%>%
prop.table()%>%
"*"(100)%>%
round(2)
head(crimesByNeigh)
## ALASKA JUNCTION ALKI BALLARD NORTH BALLARD SOUTH
## 1.35 0.49 2.13 2.96
## BELLTOWN BITTERLAKE
## 2.99 1.92
library(tibble)
as.data.frame(daysByNeigh)%>%rownames_to_column()
## rowname daysByNeigh
## 1 ALASKA JUNCTION 8.968437
## 2 ALKI 14.463768
## 3 BALLARD NORTH 7.788203
## 4 BALLARD SOUTH 5.262384
## 5 BELLTOWN 5.012865
## 6 BITTERLAKE 7.442241
## 7 BRIGHTON/DUNLAP 8.533382
## 8 CAPITOL HILL 4.936174
## 9 CENTRAL AREA/SQUIRE PARK 12.236743
## 10 CHINATOWN/INTERNATIONAL DISTRICT 3.269826
## 11 CLAREMONT/RAINIER VISTA 8.422852
## 12 COLUMBIA CITY 7.620129
## 13 COMMERCIAL DUWAMISH 14.918301
## 14 COMMERCIAL HARBOR ISLAND 8.711656
## 15 DOWNTOWN COMMERCIAL 3.563182
## 16 EASTLAKE - EAST 5.876214
## 17 EASTLAKE - WEST 4.783784
## 18 FAUNTLEROY SW 7.296050
## 19 FIRST HILL 7.113924
## 20 FREMONT 5.921374
## 21 GENESEE 7.522171
## 22 GEORGETOWN 5.054068
## 23 GREENWOOD 5.589741
## 24 HIGH POINT 17.031684
## 25 HIGHLAND PARK 8.523060
## 26 HILLMAN CITY 11.087692
## 27 JUDKINS PARK/NORTH BEACON HILL 8.485470
## 28 LAKECITY 11.369918
## 29 LAKEWOOD/SEWARD PARK 7.532511
## 30 MADISON PARK 6.662824
## 31 MADRONA/LESCHI 7.781613
## 32 MAGNOLIA 6.684218
## 33 MID BEACON HILL 6.706631
## 34 MILLER PARK 8.580793
## 35 MONTLAKE/PORTAGE BAY 6.749860
## 36 MORGAN 12.205063
## 37 MOUNT BAKER 5.378289
## 38 NEW HOLLY 4.888386
## 39 NORTH ADMIRAL 11.287831
## 40 NORTH BEACON HILL 7.155411
## 41 NORTH DELRIDGE 11.480690
## 42 NORTHGATE 6.537741
## 43 PHINNEY RIDGE 6.598978
## 44 PIGEON POINT 10.464837
## 45 PIONEER SQUARE 4.211331
## 46 QUEEN ANNE 7.578140
## 47 RAINIER BEACH 6.885420
## 48 RAINIER VIEW 9.779054
## 49 ROOSEVELT/RAVENNA 7.012361
## 50 ROXHILL/WESTWOOD/ARBOR HEIGHTS 9.711947
## 51 SANDPOINT 10.295633
## 52 SLU/CASCADE 5.187209
## 53 SODO 5.818391
## 54 SOUTH BEACON HILL 12.132686
## 55 SOUTH DELRIDGE 10.480810
## 56 SOUTH PARK 14.166928
## 57 UNIVERSITY 5.509077
## 58 WALLINGFORD 8.033284
Knowing how it works, we can create two data frames:
daysByNeigh=as.data.frame(daysByNeigh)%>%rownames_to_column()
crimesByNeigh=as.data.frame(crimesByNeigh)%>%rownames_to_column()
num_num=merge(daysByNeigh,crimesByNeigh) # 'row name' is the "key"
head(num_num)
## rowname daysByNeigh crimesByNeigh
## 1 ALASKA JUNCTION 8.968437 1.35
## 2 ALKI 14.463768 0.49
## 3 BALLARD NORTH 7.788203 2.13
## 4 BALLARD SOUTH 5.262384 2.96
## 5 BELLTOWN 5.012865 2.99
## 6 BITTERLAKE 7.442241 1.92
Once we have the data organized, the clear option is the scatterplot:
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh))
plot1= base + geom_point()
plot1
We can improve the plot, this time introducing ggrepel:
library(ggrepel)
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,
label=rowname)) # you need this aesthetics!
plot1= base + geom_point()
plot1 + geom_text_repel()
We can limit the labels, annotating the ones that represent at least 5% of the crimes in the city:
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname))
plot1= base + geom_point()
plot1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
num_num$rowname, "")))
Notice the difference without ggrepel:
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh))
scatp1 = base + geom_point()
scatp1 + geom_text(aes(label=ifelse(crimesByNeigh>=5,num_num$rowname, "")))
The good thing is that ggrepel makes better use of the space:
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh,label=rowname))
base + geom_point() + geom_text_repel(aes(label=ifelse(num_num$rowname=='NORTHGATE',
num_num$rowname, "")))
An alternative, to highlight overlaping of points:
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh))
scatp1 = base + geom_hex(bins = 10)
scatp2= scatp1 + geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
num_num$rowname,
"")))
scatp2 + scale_fill_distiller(palette ="Greys",direction=1) # try -1
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.
The palettes can be selected from the brewer colors website. Using the same palette as before, we can try a different plot (stat_density_2d):
base = ggplot(num_num, aes(daysByNeigh,crimesByNeigh))
scatp1 = base + stat_density_2d(aes(fill = ..density..),
geom = "raster", contour = FALSE)
scatp2=scatp1+geom_text_repel(aes(label=ifelse(crimesByNeigh>=5,
num_num$rowname, "")))
scatp3 = scatp2 + theme(legend.position='none', axis.title.y = element_text(size=10, vjust = 3), axis.title.x = element_text(size = 10, vjust = 0), title = element_text(size = 8)) + labs(x = "Average Days To Report", y = "Proportion of Crime", title = "Proportion of Crime and Average Days to Report Per Neighborhood", caption = "Source: City of Seattle Crime Data")
scatp4= scatp3 + scale_fill_distiller(palette="Greys", direction=1)
scatp4
The extra space you see can dissappear using:
scatp5 = scatp4 + scale_x_continuous(expand = c(0, 0), breaks=seq(0,16,2)) +
scale_y_continuous(expand = c(0, 0), limits=c(0,5))
scatp5
## Warning: Removed 4 rows containing non-finite values (stat_density2d).
## Warning: Removed 4 rows containing missing values (geom_text_repel).
Exercise 4:
Complete the elements missing in the previous plots.